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Abstract 

The nni-distance is a well-known distance measure for phylogenetic trees. We construct 
an efficient parallel approximation algorithm for the nni-distance in the CRCW-PRAM 
model running in O(logn) time on 0(n) processors. Given two phylogenetic trees T\ and 
T2 on the same set of taxa and with the same multi-set of edge-weights, the algorithm 
constructs a sequence of nni-operations of weight at most O(logrt) • opt, where opt denotes 
the minimum weight of a sequence of nni-operations transforming T± into T2. This algorithm 
is based on the sequential approximation algorithm for the nni-distance given by DasGupta 
et al. (2000). Furthermore, we show that the problem of identifying so called good edge- 
pairs between two weighted phylogenies can be computed in O(logn) time on O(nlogn) 
processors. 

1 Introduction 

Phylogenetic trees (or phylogenies) are a well-known model for the history of evolution of species. 
Such a tree represents the lineage of a set of todays species, or more generally a set of taxa, which 
are located at the leaf-level of the tree. The set internal nodes and the topology describe the 
ancestral history and interconnections among the taxa. Usually phylogenetic trees have internal 
nodes of degree 3. A weighted phylogeny additionally imposes weights on its edges, representing 
the evolutionary distance between two taxa or internal nodes. We call a phylogeny unrooted or 
rooted, for the latter case if a common eldest ancestor is known and is designated as the root of 
the tree. 

Concerning the reconstruction of phylogenetic trees from a given set of genetic data, a number 
of different models and algorithms have been introduced over the past decades. Each method 
is based on a different objective criterion or distance function in the course of construction - 
for example parsimony, compatibility, distance and maximum likelihood. Due to this fact, the 
resulting phylogenies may vary according the internal topology and leaf configuration, although 
they have been created over the same set of taxa. Hence it is a reasonable approach to compare 
different phylogenies for their similarities and discrepancies. As well for this task many different 
measures have been proposed, including subtree transfer metrics |AS01| . minimum agreement 
subtrees |FG85] et cetera. 

In this paper we focus on a restricted subtree transfer measure to compare phylogenetic trees, 
namely, the nearest neighbor interchange distance (nni), which was introduced by D.F. Robinson 
in |Rob71| . A nni- operation swaps two subtrees, which are both adjacent to the same edge e in 
the tree. See Figure [T] for an illustration of the nni-operation. The nni-distance between two 
trees is the minimum number of nni-operations required to transform one tree into the other. 
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(a) possible nni-operations 




Figure 1: The possible non- redundant nni-operations relative to an internal edge e = (u, v). Each 
triangle A,B,C,D represents a subtree of the tree. The uniform cost of this operation is the weight wt(e) 
of edge e. 

1.1 Previous Results 

Although the nni-distance has a simple definition in terms of a transformation of subtrees, the 
efficient and fast computation turned out to be surprisingly challenging. 

For more than a decade, since its introduction in 1971 by Robinson [Rob71J, no efficient al- 
gorithm for computing the nni-distance was known for practical (large) instances of phylogenetic 
trees. Day and Brown |Day85| were the first to present an efficient approximation algorithm for 
unweighted instances. The algorithm runs in 0{n\ogn) time for unrooted and C(n 2 logn) time 
for rooted instances. 

Li, Tromp and Zhang |LTZ96j gave logarithmic lower and upper bounds on the maximum nni- 
distance between arbitrary 3-regular trees. Furthermore, they gave an outline of a polynomial 
time approximation algorithm for unweighted instances with approximation ratio logn + 0(1). 

DasGupta, He, Jiang, Li, Tromp and Zhang [DH J + 00] proved the NP-completeness of com- 
puting the nni-distance on weighted and unweighted instances, and on trees with unlabeled (or 
non-uniformly labeled) leaves. They gave an approximation algorithm with running time 0(n 2 ) 
and approximation ratio 4 logn + 4 for weighted instances. Furthermore, they observed that 
the nni-distance is identical to the linear-cost subtree-transfer distance on unweighted phyloge- 
mcs |DHJ+99| and gave an outline of an exact algorithm for distance-restricted instances with 
running time 0(n 2 logn + n • 2 nd ). 

1.2 Our Work 

In this paper, we present an efficient parallel approximation algorithm for the nni-distance on 
weighted phylogenies. This algorithm runs on a CRCW-PRAM in time O(logn) with O(nlogn) 
processors and yields an approximation ratio of O(logn). It is based on the sequential approxi- 
mation algorithm by DasGupta et. al. |DHJ + 00) with running time C(n 2 ) and approximation 
ratio 4(1 + logn). Especially, we obtain a CRCW-PRAM algorithm with time O(logn) and 
0(n) processors for the case when no good edge-pairs exist. 

The paper is organized as follows. In Section [2] we give formal definitions of phylogenies 
and the nni-distance. In Section I2.1| we describe the sequential approximation algorithm of 
DasGupta et. al. |DHJ+00) . In Section [3] we present our new parallel approximation algorithm 
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which consists of efficient parallel algorithms for linearizing trees (Section 13, ip . sorting edge- 
permutations on linear trees (Section I3.2p and sorting leaf-permutations on binary balanced 
trees (Section I3.3|i . Finally, in Section [3.4| we present an efficient parallel algorithm to identify 
good edge-pairs between two phylogenetic trees, in order to be able split up large instances and 
distribute the computational task already in a pre-computational step. 

2 Preliminaries 

We will make use of the following notation. Let T = (V, E) be an undirected or directed tree, 
then Ct Q V denotes the set of leaves of T and It C V the set of internal vertices of T. 
The most important primitives in phylogenetic analysis are taxa and phylogenies. 

Definition 1. Given a finite set of taxa S = {s%, . . . , s n }, a phylogeny for S is a triplet 
T = (V, E, A) where (V, E) is an undirected tree, A : Ct S is a bijection and such that every 
internal node of T has degree 3. A rooted phylogeny for S is a tuple T = (V,E,X,r) such that 
(V, E, A) is a phylogeny and r € V is the root of T. A weighted phylogeny for S is a tuple 
T = (V, E, A, wt) such that (V,E,X) is a phylogeny and wt : E — > M + is a weight function on 
the set of edges of T . A rooted weighted phylogeny is a tuple T = (V,E, A, wt,r) such that 
(V, E, A, r) is a rooted phylogeny and wt : E — > R + is an edge-weight function. 

The nni-distance is the minimum number of nearest neighbor interchanges (nni) needed in 
order to transform one tree into another (RF79| : 

Definition 2. Let T be a phylogeny (possibly rooted and/or weighted) and let ei,e2,e3 be three 
edges of T that build a path of length three in T (in this order). The associated nni-operation, 
denoted as a triplet (ei, 62,63), transforms the tree T into a new tree T' by swapping the two 
subtrees below the edges e\ and e^ as shown in the Figure In this configuration we call the 
center edge ei the operating edge. In case of weighted phylogenies the cost of this nni-operation 
is defined as wt(e2). 




Figure 2: The nni-operation on T of the subtrees A and B defined by the triplet (ex, e%, e^). 

The associated genetic distance measure is the nni- distance: 

Definition 3. Let S be a set of taxa and let T\,T2 be phylogenies for S. The nni-distance 
dnni(Ti,T2) of T\,Ti is the minimum length of a sequence of nm- operations that transforms T\ 
into T<i (and 00 in case no such sequence exists). In case of weighted phylogenies d nn i(Ti,T-2) is 
the minimum cost of a sequence of 'nni- operations that transforms T\ into Ti- 

Given two weighted phylogenetic trees Tj = (Vi,Ei, Aj, wtj), i = 1, 2 for the same set of taxa 
S, the following two conditions are necessary for the two trees to have a finite nni-distance. 

1. For each taxon s £ S, let ej(s) € Ei be the edge incident to the leaf with label s in Tj 
(i = 1,2). Then ei(s) and e2(s) must have the same edge weight: wti(ei(s)) = wt2(e2(s)). 
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2. Mi = M 2 , where Mj denotes the multiset of edge-weights of Tj. 

In order to identify parts or subtrees of the tree that require a "large" or "small" amount of 
work to be transformed into their counterparts from the other tree, the notion of good edge-pairs 
and bad edges or non-shared edges according to the set of leaf-labels and edge-weights is used in 
the literature (cf. |RF791 |DH.T%)] ). 

Definition 4. (Good Edge-Pairs, Bad Edges) 

Let T\ and T 2 be two weighted phylogenies for the set of taxa S. Two internal edges ej € Er x 
and ej £ Et 2 form a good edge-pair if and only if the following conditions hold: 

1. wti(ej) = wt 2 (ej). 

2. Both edges induce the same partition of the multiset of edge-weights on T\ and T 2 . 

3. Both edges induce the same partition of the set of leaf-labels on T\ and T 2 . 

An edge ej S E\ is called bad if there does not exist any edge ej G E% such that (ei,6j) forms a 
good edge-pair. 

If ei and ej form a good edge pair, no nni-move with operating edge is needed to transform 
T\ into T 2 . 



2.1 DasGupta's Sequential Approximation Algorithm 



In this section we give an outline of DasGupta's approximation algorithm |DHJ + 00 for the 



nni-distance on weighted phylogenies on a set S of n taxa. For the ease of notation we assume 
that the phylogenies are rooted. Unless otherwise mentioned we will refer to these rooted and 
weighted phylogenies on S as phylogenies for short. Hence for the rest of this paper, a phylogeny 
is always a rooted and weighted phylogeny T = (V, E, A, r). 



Theorem 1. |DHJ + 00 LetT\ andT2 be two phylogenies. Then d nn i(T\,T2) can be approximated 
within 0(n 2 ) time and A.R. 4(1 + logn). 

Given two phylogenies T\ , T 2 , at first the multisets of edge- weights of internal edges of both, 
T\ and T 2 , are sorted in 0{n log n) time. In case these two multisets differ, T\ and T 2 do not have 
a finite nni-distance. Hence, from now on we assume that {w\, u> 2 , . . . } w n -^} is the multiset 
of edge-weights of internal edges of both T\ and T 2 and that w\ < W2 < • • • < w n ~3 holds. 
Furthermore let W := Y^iZi w i be the sum of all edge weights of internal edges of Tj, i € {1, 2}. 

Lemma 1. [DHJ + 00J If d nn j(Ti,T 2 ) < 00 and T\ and T 2 have no good edge pairs, then 
dnni(Ti,T 2 )>W. 

DasGupta's algorithm makes use of two different trees associated to each of the given phy- 
logenies T\ , T 2 , which we call the auxiliary tree and the linear tree. 

Let T = (V, E, A, wt, r) be a phylogeny. An auxiliary tree T' = (V, E', A, wt', r) is a phylogeny 
on the same set of vertices V and labeling of taxa A that has the following properties: 

• all leaves 1,1' € Ct 1 are of balanced height, \depthj"{l) — depthj"{l') \ = 1, 

• the multisets of edge-weights in the trees T and T' are the same, M = M', 

• the edge-weights of internal edges on every path from r to a leaf in T' are non-descending. 

If the set M of edge-weights is sorted such that w\ < W2 < • • • < w n -3 holds, we achieve 
the auxiliary tree property by arranging the edge-weights in M on an binary balanced tree such 
that, at level i, w 2 »_ 1+ j is the j-th edge-weight assigned to an edge from the left. DasGupta's 
algorithm constructs auxiliary trees T[ = (Vi, E^, Aj, wt^, r^), i = 1,2, for T\ and T 2 . Then both 
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the original phylogenies Tj and the associated auxiliary trees T- are transformed into so called 
linear trees: For a given phylogeny T = (V, E, A, wt, r), a linear tree Lt = (V,E",X,\Nt",r) of 
T is a phylogeny with the same labeling A and such that every internal node is adjacent to at 
least one leaf (cf. Figure [3]). 



C-l 



e-2 



Figure 3: The linear tree L with internal edges ei, e2, . . . , e„_3. 

Then a variant of merge-sort is used to transform the order of internal edge of Lt x into the 
ordering of Lt 2 ■ To transform the auxiliary tree T[ into T'^ it remains to sort the order of leaves 
to complete the transformation from T\ into T2 ■ Algorithm Q] gives a pseudo-code description of 
DasGupta's algorithm. 



Algorithm 1: DasGupta's_Sequential_Algorithm 



Input: Rooted phylogenetic trees T\,Ti. 

Output: nni-distance d nn i(T\,T<i) and a sequence M of nni-operations transforming T\ 
into T2. 

begin 

for i = 1, 2 do 

Construct auxiliary trees T[\ 

/* generate nni-sequence Mi to transform Tj into T[ */ 

Generate sequence (i^i, . . . ,t i j^) that transforms Tj into a linear tree L^; 
Generate sequence (a^i, . . . , cii t k(i)) that transforms T[ into a linear tree Lj<i\ 
Generate merge-sori-sequence (s^i, . . . ,Sj /(j)) that transforms into L r /; 

A/i := j ■ • • , ^ij(i)) • • • ) s i,l(i) s a i,k(i) > • • • > j 

/* note that sequence (0^1, . . . , 0$ is reversed in order to allow 
back-transformation to T/ 



Generate sequence (61, ... , 6 m ) to transform T{ into T^; 
AA:=A"io(6i,...,6 m )oA^; 

/* note that sequence N% is reversed for back-transformation to T2 



*/ 



*/ 



In case there exist good edge-pairs, these pairs yield a decomposition of ?i,T2 into subtrees 
and Algorithm [1] is applied to each pair of associated subtrees from T\ and Ti ■ The parallel 
computation of good and bad edges will be treated in Section [331 in the following, let us assume 
that there exists no good edge-pair between T\ and T2. 



3 Parallel Computation of the nni-Distance 

In this section we construct efficient parallel algorithms for the three steps of DasGupta's algo- 
rithm in the CRCW-PRAM-model. We start with a definition for the classification of internal 
nodes. 

When T is a 3-regular phylogeny (i.e each internal node has degree 3 in T), the internal 
nodes of T can be classified with respect to the number of adjacent leaves. 

Definition 5. Let T = (V, E, A, wt) be a 3-regular phylogeny. Let C be the set of leaves in T. 
An internal node v G I = (V \ C) is called 
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• an endnode (v £ V en d), if it is adjacent to two leaves and one internal node, 

• a pathnode (v € Vjj a thA if it is adjacent to one leaf and two internal nodes, 

• a junction-node (v € VjuncJ, if it is adjacent to three internal nodes in T. 

This notation will be used in the course of the linearization-step [2] of the sequential algorithm. 
3.1 Linearizing Trees 

In the first algorithmic step, both T\,T2 and their associated auxiliary trees T[,T^ are trans- 
formed into linear trees L±, L%, L^, L' 2 respectively (cf. Figure [3]). Let us first give an outline of 
our parallel linearization procedure, which consists of three phases: 

1. Activation-Phase: We proceed in a bottom-up manner at the boundary of the tree, i.e. at 

endnodes v G V eri d defined above. At every endnode v a process is started that builds the 
path to the next junction-node u £ Vj unc and activates u to prepare the junction node for 
insertion of the path from v. 

If a junction- node u is activated by more than one endnode in the activation phase, among 
the two paths meeting at u we select the one of smaller weight for insertion. Let this path 
consist of k internal edges e±, . . . , e& where ei is incident to it. 

2. Insertion-Phase: We generate the sequence of nni-operations that is used for the insertion of 

the selected path at the junction-node u. This yields a sequence of nni-operations of length 
k, the length of the path to be inserted. The internal edges ei, . . . , e& are the operating 
edges of these nni- moves. 

3. Update-Phase: In the last phase the tree topology and the pointers inside the tree are 



These three phases are repeated until the trees Ti,T2,T{, are transformed into linear trees 
Li, L2, L' t , L' 2 , respectively. 

Generating the Endnode-Paths for Insertion Algorithm [2] computes for every node v the 
distance dist(w), edge-list path(u), length length(tj) and the head head(w) of the path to the next 
junction- or endnode next(t> ) heading towards root r. These values are computed efficiently in 
parallel via parallel pointer jumping in O(logra) time on n processors. 



updated. 



(a) situation at junction-node u 



(b) after insertion of path(ufe) 




r 



r 



U: 



W 



Figure 4: Insertion of path(wfc) from endnode Vk adjoining junction-node u = next(wfe). 
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Algorithm 2: Endnode_Paths 
Input: Phylogeny T with root r and pointer parent(u) for all v in T and sets of junction- 

and endnodes Vj unc an d Vend- 
Output: For every node v in T the values dist(f), path(w), length(u), next(u) and head(u). 

begin 

foreach v £ V parallel do 

dist(u) :=wt(e v ); /* initialize with parent edge e„ = (v, parent(-u)) */ 

path(v) := e„; 

head(t;) := v; 

length(-u) := 1; 

next(f) := parent(i>); 

while next(u) ^ Vj unc U V en ^ do 

dist(u) := dist(-u) + dist(next(u)); 

path(u) := path(u) o path(next(t> )); 

head(u) := next(u); 

length(u) := length(u) + length(next(u)); 

next(v) := next(next(u)); /* Pointer- Jumping */ 



Parallel Linearization of Trees We are now ready to formulate Algorithm [3] for the lin- 
earization of a tree T. Figure U] illustrates the notation used in Algorithm [2] and [3j and shows 
the result of an insertion- process. 



Algorithm 3: Parallel _Linear_ Tree 



Input: A phylogeny T with root r. 

Output: A list A/of nni-operations which transforms T into Ly. 



/* re-generate paths and pointers */ 



/* activate u from v^, k= length(ufc) */ 



while 3u € Vj unc do 

Endnode_ Paths (T) ; 
foreach Vk € V en d parallel do 
u := next(u fe ); 
a(u) := v k ; 

foreach active u € Vj unc parallel do 
x := (sib(it) 7^ head(a(tt))) ; 
foreach 1 < i < k parallel do 

J\f u [i] := ((leaf (f j), f j) , e,, e^); /* generate nni-triplets for every 

operating edge a on the path to Vj. */ 

J\f := J\f o J\f u i /* concatenate list of nni's */ 

parent^) := v^; /* insertion of the path at x */ 

wt((x,v k )) := wt((x,u)); 

Vj unc := Vj uric \ {u}; /* deletion of u from the set of junction-nodes */ 



Lemma 2. Algorithm^ transforms a given phylogeny T into a linear tree Lx in O(logn) time 
on n processors. 



Proof. In every iteration endnode-paths are newly generated in time O(logn) on n processors. 



Then junction- nodes are activated and paths are inserted in parallel for every active junction- 
node, i.e. for every active endnode in constant time using n processors. 

Now let \V e „d\ = Iq be the initial number of endnodes in T in iteration of the linearization- 
step. Now every endnode v € V eric i tries to activate the next junction-node next(u) towards the 
root of T. This will be successful for at least every second endnode, since one junction-node is 
shared by at most two endnodes. Therefore at least insertions of an endnode-path path(f) is 
carried out at next(u) in each iteration and the number of end- and junction-nodes is reduced 
by at least in iteration i. Thus the number of iterations is bounded by [ktgnj. □ 

3.2 Sorting Edge-Permutations on Linear Trees 

This phase refers to step [3] of the sequential algorithm. We are starting with two linear trees L\ 
and L'-y associated to the original tree T\ and the balanced tree T[ with presorted edges. Now 
the sequence of nni-operations will be generated that transforms the sequence , e' ,...,e' , of 
internal edges in L\ into the linearized sorted sequence, say e'(, . . . , e"_ 3 , of L' v 

The general approach of the sequential algorithm is first to transform adjacent edge-pairs 
by nni-moves, such that afterwards the whole sequence is pairwise alternating from ascending 
to descending according to the sorting order of e'{, e^, ■ ■ ■ , e^_3 (the ascending and descending 
subsequences of edges will be called blocks). Then, starting from the middle, we merge and pull 
out adjacent blocks via nni-operations, finally resulting in a linear tree of blocks of doubled size, 
again alternating. At k-th stage, we begin with ^ blocks of 2 k internal edges each, resulting 
in blocks consisting of 2 • 2 k edges. See Figure [5] for an illustration. The sorting algorithm 
terminates if the resulting sequence consists of only one block, containing all edges. 



(a) initially unsorted tree L 1 with = l,i = 8 



(c) L 3 after first merging-stage: = 2,i = 4 




(b) L 2 with pairwise alternating edge-weights 

* v N v s v \ 



(d) L 4 after second merging-stage: \Bi\ = 4, i = 1 




Figure 5: Sorting edges on a linear tree L via merging and pulling out alternating sequences of 
edge- weights Bi. Note, that the length |Bj| of the sorted sequences doubles in every merging-stage. 



Parallel Tree Merging Now we describe an efficient parallel algorithm for sorting the edge 
permutations. We will not only consider the two adjacent blocks in the middle for comparing 
and merging, but all the ^ block-pairs that will be adjacent in the course of stage k in parallel. 
So we have to describe the pairing of blocks and edges inside blocks for each stage in order to 
allow for parallel computation. 

At stage k let Bi, B2, ■ ■ ■ , Bjl be the blocks appearing in that order on the linear tree. We 
start pairing recursively from the middle, such that Bi pairs with for I € {1, . . . , ^fc}- 

Furthermore, let e(;_ 1 ) 2 fc, i) 2 *+i> • • • > e i2 k be the edges of block B[ at stage k. 

To preserve simplicity, we illustrate the merging of edges of two blocks within a pair (B k , By), 
which is said to be a block-pair to get adjacent and to be merged at stage k within the linear 
tree L k . Let e k denote the edge at position i in L k . The new position of this edge within L k+1 



8 



is denoted by , where p is the rank (regarding its edge-weight compared and ranked with 
the edge- weights of the opposite block) of e k in the opposite block of the merging-stage plus the 
number of equally ranked edges positioned before e k within the same block. 

So if e k € B k is at position i in L k , we have p = rank(e k \B k ) + \{e k G B k \j<i, rank(e^) = 
rank(e^)}| and the position changes from e k &\Xy in L k+1 , as shown in Figure [6j We compute 
the ranking and positioning for all internal edges of the block-pair (B k ,B k ) in parallel. 



Bi 



L k. ... 

i 

L k+i. ... 



k+l 
'i+p 



B k+l 



Figure 6: Ranking edges within a block-pair (B k ,B k ) on the linear tree L k , resulting in L k+1 with 
doubled block-size at the combined block B^ 1 . 

Furthermore, this sorting procedure is performed in parallel for all block-pairs which get 
adjacent in stage k on L k with total number of ^ • 2 k = n processors running in 0(1) time. In 
order to compute the sequence of nni-operations, needed for the transformation of L k -w L k+1 
we look at both the block-pair (B k , B k ) and the combined block B k y l . The sequence of internal 
edges e k+1 , . . . , of B k y l yields the sequence of operating edges. To complete the nni-triplet, 
we find in parallel for every edge e k+1 the next edge from the opposite block with respect to 
the situation on L k appearing in the sequence. The nni-triplet is generated via the edge of the 
'outer' leaf of e k+1 and the first e k+1 from the opposite block. 

The actual pairing situation if the nni-operations would be performed sequentially on L k is 
shown in Figure [71 

B k x 



• — 

Figure 7: Merging edges of a block-pair (B%, By) via nni-operations. 
After pairing up every edge in the sequence we have: 

• if wt(e^) < wt(e^) and B x / / /X \B y ^qI^ ^he next nni-operation is nni(e^, e k , l k ) 

• if wt(e^) > wt(e^) and B x ^\/ B y Yiolds, the next nni-operation is nni(l k ,e k ,e k ) 
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We are now ready to state Algorithm 2] to compute the sequence of nni-operations used for 
merge-sorting a linear tree L. 



Algorithm 4: Tree_Merge_Sort 
Input: Linear tree L, permutation ei, e2, . - . , &n— 3 of internal edges of L. 
Output: Sequence M of nni-operations that transforms L into L' with internal edges 
sorted. 

for k = 1 to log n do 

for each I G {1, . . . , j^e} parallel do 
B x := B f , 

B xy := ir\eige(B x , By); /* Merging two blocks via ranking edges */ 

L k := B xy o L k ; /* at the end of the f oreach-Phase in the fe-th 

iteration, L k = e k , . . . , e k _ 3 */ 

foreach e k G L fe parallel do 

e k : = next edge from opposite block; 

if wt(ef) < wt(e^) and B w /\^y then 

|^ nni(i) := (e k ,e k ,l k ); /* as illustrated in Figure [7] */ 

else if wt(ef') > wt(e^) and R x \/ then 

L nni («) : = (^)ef' e i); 
A/" := A/"o nni(i); 



We obtain the following Lemma: 

Lemma 3. The sorting of edge-permutations is performed in O(logn) time on n processors. 

Proof. In Algorithm U] the length of the sorted sub-sequences \B[\ doubles with every merging- 
stage. Therefore at most logra complete merging-rounds are needed to yield a sorted sequence 
of length n. In the merging-steps of stage k, we have ^ blocks of length 2 k which are compared 
and merged to blocks of doubled size using ^ • 2 k = n comparisons, i.e. allocating n processors 
and yielding a running time of 0(1). The subsequent generation of the nni-triplets also uses n 
processors for 0(1) time per stage. □ 

3.3 Sorting Leaf-Permutations on Balanced Binary Trees 

This phase refers to step U] of the sequential algorithm. We are given two binary balanced trees 
T^T^ that only differ in the ordering of leaves. The sequential algorithm generates a sequence 
J\f of nni-operations which implement the cycles of the permutation of leaves transforming T[ 
into T'^. We show how to generate this sequence efficiently in parallel. 

Let d be the depth of T[ and T' 2 . When T[ is transformed into T' 2 by use of the sequence J\f, 
the corresponding intermediate trees might be unbalanced. More precisely, let n : {1, ...,«}—)■ 
{1, . . . , n} be the permutation transforming the order of leaves li, . . . , l n in T[ into ■ ■ ■ , l w ( n ) 
in T'2 . Let tt consist of cycles C\ , . . . , C\. . Then M = M\ o • • • o J\f k where Mi implements cycle 
Cj. Let Ci = (cj i, . . . , Ci ; q) be one cycle, then Mi = Mi t i o ■ ■ ■ o M^q, where Mij is a sequence of 
nni-operations which transports the leaf l Ci . to its new position in T 2 (cf. Figure [8]). 
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Figure 8: Transportation of leaf l Ci . at position s to its target position t with leaf l Ci j+1 attached. 

Let Tn denote the tree that results from applying sequence H of nni-operations to the tree 
T[ . For each prefix % of J\f, the tree Ty_ has depth d or d+1, hence the set of possible positions 
of edges in T% is P = \1 < I < d+1,1 < j < 2 1 }. Then each of the trees Tjsf lQ ... oMj differs 

from T[ only w.r.t. the order (positions) of leaves, i.e. all the internal edges have the same 
position as in T[. Furthermore each Tj\f 10 ... j\f j0 j\f j+1 l0 ---oAf +1 h is one of the imbalanced trees T s j 
of depth d+1 with s,t £ P positions of depth d — 1 and d + 1 respectively (cf. Figure [9]). 




> depth 



The positions of internal edges in T s j only depend on s and t: if internal edge e has position 
(l,p) in T{, then its position in T s> t is one of [(l,p), (l — q, L|J)) O + Ij (^ + 1, 2p+l)] depending 
on if the edge e is on the path from s to t and if it is on the ascending or descending part of 
this path. Hence for each prefix T~i of J\f of the form % = M\ o • • • oNj o jV}-fi,i o • • • o Mj+xh the 
positions of edges pn : E — > P in the tree T!^ which results from T[ by application of % can be 
computed efficiently in parallel. 

Lemma 4. The sorting of leaf-permutations on two binary balanced trees can be done in time 
O(logn) on n processors. 

Proof. Since the height of balanced binary trees is bounded by [log n] , the positions of edges 
p-}i : E P in T!^ for a prefix % of N can be efficiently computed in time O(logra) on n 
processors. Thus it remains to describe how to compute the sequence A/j+i^+i for a given 
H = TVi o • • • o J\fj o A/}+ i,i o • • • o J\fj +1)h and pu as above: 
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Let Cj+i = (9+1,1, . . . , c j+ i jh+ i, c j+ x :h+2 , . . . ) be the (j + l)-th cycle of n and let T H = T r>s 
and Tu fs j+lh+1 = T r>t with r = (d — 1, [ Cj "^ 1 ' 1 J). If s = (d + l,x) and t = (d + l,y) with 
x = 2 d ' ■ a — j x and y = 2 d ' - a — j y with j x , j y € {0, . . . , 2 d ' — 1} then the lowest common ancestor 
is at position (d' , a) (see Figure [T0|) and Afj+ih+l i s a sequence of 2(d — d') — 1 nni-operations. 
Finally, A/}+i h-f i can be constructed in O(logn) on a single processor since d' < d < logn and 
the positions of edges in the tree T% are known at that point. □ 



lowest common ancestor 




Figure 10: Transportation-path between s and t via the lowest common ancestor in T rjS . 

This completes the last step of our parallel algorithm for approximating the nni-distance 
between two weighted phylogenies and we get the following theorem as a corollary of Lemma [2l 
[3 and |3 

Theorem 2. The nni-distance between two phylogenies T\ and T2 and the sequence of nni- 
operations can be approximated within approximation ratio O(logn) in O(logn) time on n pro- 
cessors. 



In the last section, we present a parallel algorithm to compute good edge-pairs in order to be 
able to split up large problem instances in a pre-processing step and to identify edges, for which 
no nni-operation is needed in order to transform the trees into each other. 



3.4 Detecting Good Edge-Pairs 

Our aim is to identify good edge-pairs (e x ,e y jH\ with wt(e x ) = wt(e J/ ), e x 6 Er 1 and e y £ Et 2 , 
which induce the same partition on the set of leaf-labels and edge- weights in their corresponding 
tree (cf. Definition [4j . 

In |DHJ + 00] this computational step is performed in 0(n 2 ) time which dominates the total 
running time of the orig inal algorithm. In jHKLOOl lHKL+04| Hon et al. give an improved 
algorithm for computing good edge-pairs, whose running time is C(nlogn). In the following, we 
adopt the approach of Hon et al. and design an efficient parallel algorithm running in O(logn) 
time on O(nlogn) processors. Let us first give an outline of the approach of Hon et al. 



Partition-Labeling Problem In [HKL + 04j Hon et al. define a problem called the partition- 



labeling problem on two rooted trees and present a solution running in O(nlogn) time. Then, 
the problem of computing good edge-pairs between two weighted phylogenies is reduced to this 



1 not necessarily having x = y in similar labeled edge-sets, with the set of edge-weights being a multiset 
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problem in time O(nlogn). Therefore, the time complexity of the original algorithm is improved 
from 0(n 2 ) to O(nlogn). A partition-labeling between two rooted trees is defined as follows: 

Let R and R' be two rooted trees with leaves labeled by the same multi-set S of leaf labels. 
Let A be any subset of 5(S), where S(S) is the set of distinct symbols or labels in S. For each 
internal node u 6 V(R), Lfc{u) is defined as the multi-set of leaf labels in the subtree of R 
rooted at u, and Lr(u)\A to be the restriction of Lr(u) to A. Given R and R' , let V and V' 
be the sets of internal nodes in R and R', respectively. A pair of mappings p : V — > [1,£] and 
p' : V — > [1,£], £ = \V\ + \V'\, is called a partition-labeling for R and R' , if for all u £ V and 
v € V, p(u) = p'(v) if and only if Lr{u) = Lr>{v). 

The partition-labeling problem is to find a partition-labeling (p, p') for R and R' . A straight- 
forward approach is to compute all multi-sets of Lr(u) and Lr(v), but this, similar to the ap- 
proach of DasGupta et al., also takes 0(n 2 ) time. In order to reduce the time complexity, Hon et 
al. compute the multi-sets in an incremental manner and compare them based on earlier partial 
results. For this purpose, let Ra be the contracted subtree of R induced by A, containing only 
leaves with labels in A and their common lowest ancestors. Algorithm [5] shows the framework 
of the method described by Hon et al. in [HKL + 04] . 



Algorithm 5: Partition _ Labeling 

Input: Two rooted trees R, R' with leaves labeled by the same multi-set S. 
Output: Partition-labeling (p, p') for R,R'. 

foreach A { G {A 1 ,A 2 , . . . ,A\ S ^\} do 
I Compute partition-labeling for Ra { and R' A . ; 

for k = 1 to log n do 

Let A\, A2, ■ ■ ■ be the labels considered in the last round; 
Pair up Aj's such that ^j-i = -^2j-i U Aij\ 
Delete all Aj's and set A<ij-\ =: Aj; 
foreach Aj do 

|^ Compute partition-labeling for Raj and R' A . based on the result of last round; 



We will now show how the first foreach-phase can be efficiently computed in parallel. 

Lemma 5. The induced subtree Ra can be computed in time O(logt) on O(tlogt) processors. 

Proof. Using the algorithm of Schieber and Vishkin |SV88| . with preprocessing in time O(logn) 
on O(nlogn) processors, we can answer lowest common ancestor queries for a pair of nodes in 
R in time 0(1). Furthermore, we use the Euler-Tour Technique (ETT) of Tarjan and Vishkin 
[TV84J to compute the postorder and preorder numberings of nodes in R in O(logn) time on 
0(n) processors. We construct Ra as follows, given the fact that all trees under consideration 
are 3-regular. 

Let £±,£2, ■ ■ ■ ,it De the sequence of leaves of R with labels in A and ordered from left to 
right by the preorder numbers pre(^j). We perform lowest common ancestor queries for each 
pair (£i,£i + \) of the leaf-sequence and yield the set of internal nodes W\ , W2 , ■ ■ ■ , of Ra, i.e. 
LCA(£j,£j + i) = Wj G V(Ra), for all 1 < % < t, respectively Q Now let pre(i?A) and post(i?A) be 
the (partial) preorder and postorder sequences of nodes in R restricted to the leaves and internal 

2 Note that all internal nodes of Ra are found in this way, since for every internal node w of Ra the right-most 
leaf u of the left subtree below w is neighboring the left-most leaf v of the right subtree below w in terms of the 
preorder sequence of leaves and LCA(it, 11) = w. 
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nodes of Ra- In order to reconstruct the (contracted) topology of Ra, we take both sequences 
and generate the parental pointers parent(?;), v G Ra in parallel as follows: 

For every internal node w of Ra with pre(w) = x and post (to) = y we look at position x + 1 
in pre(i? J 4) and position y — 1 in post(i?yi) to find the right-hand child and left-hand child of w, 
respectively. This can be done in parallel for every internal node Wi of Ra in 0(1) time on 0(t) 
processors. This completes the construction of the induced subtree Ra (in time 0(1) on 0(t) 
processors, with pre-processing in amortized 0(logi) time on 0(ilogi) processors). □ 



By Lemma 2.2 in |HKL + 04j . we have the following fact: Let A and B be two disjoint subsets 
of 5(S) and let u be an internal node in Raub- Then, Lr a B (u)\A = or Lr a {v) for some 
v G Ra and similarly, Lr Aub (u)\B = or Lr b {v) for some v G Rb- 

The next lemma implies that the first foreach-phase of Algorithm [5] can be completed in 
0(logn) time on 0(nlogn) processors. 

Lemma 6. Let a G 5(S), a partition-labeling for R{ a } an d R'{ a } can ^ e f oun d ^ n OQogt) time 
on 0(t) processors, where t is the number of leaves in R with label a. 

Proof. Perform a postorder numbering on R{ a } m Oilogt) time on 0(t) processors (cf. |TV84| ). 
Since LrjAu) only contains multiple copies of a, we only need to keep track of \L^ a y(u)\, i.e. 
the number of leaves of the subtree below u. The number of descendant leaves for each internal 
vertex u can be obtained from the prefix sum of the weights of edges determined in the postorder 
numbering algorithm. Assign this number to u and apply the same procedure to R'i a \ ■ D 

Now, we have the partition-labeling for Ru\ for every distinct label i G 5(5). In the next 
phase of Algorithm [S] the labels are paired together and the corresponding trees R^y and R{j} for 
i,j G 5(S) are merged to form Ruj}- A partition-labeling is computed based on the partition- 
labeling of the first round and this is repeated for log|5(S*)| rounds until a partition-labeling 
for Rg(s) i s produced. Let us describe the relabeling of the internal nodes of Raub an d R'aub 
for two distinct subsets A, B C S(S), given the corresponding partition-labelings (pa, Pa) an d 
(pb,p'b) f° r (Ra,R'a) an d (Rb,R'b)i respectively. 

First, we consider Raub- For each internal node u in Raub 7 assign a 2-tuple (a, b) to u such 
that a is set to the highest integer-label of Lr Aub (u)\A and b is set to the highest integer-label of 
Lr Aub (u)\B. If u G Ra we set a = pa(u), and if u G Rb we set b = pb{u)- If Lr Aub (u)\A = 
we set a = 0, and if Lr Aub {u)\B = we set 6 = 0. It remains the case where Lr Aub (u)\A ^ 
and u ^ Ra- Here, there exists a node v such that Lr Aub (u)\A = Lr a (v) and we set a = Pa(v), 
which is the highest label in Lr Aub (u)\A. The case for Lr Aub (u)\B ^ and u £ Rb is treated 
analogously and R'aub 1S treated in the same way as Raub- After we have determined the 
values in (a, b) for every internal node u of Raub and R'aub> ^ ne 2-tuples are sorted and a new 
integer (starting from 1) is assigned to every distinct 2-tuple. This integer is then assigned as a 
label to the internal node u and a partition-labeling paub {p'aub) ^ or Raub (R'aub) ^ s obtained. 
Algorithm [6] shows how this can be done efficiently in parallel. 

Let us now formulate the corresponding lemma and show that the labels assigned by Algo- 
rithm [6] form a valid partition-labeling. 

Lemma 7. Given the partition-labelings (pa,p'a) an d (pb,p'b) f or (Ra,R'a) an d (Rb,R'b)> we 
can compute partition-labelings paub an d p'aub f or Raub an d Raub in 0(\ogt) time on 0(t) 
processors where t is the number of leaves in Raub- 

Proof. In Algorithm [6l we perform a bottom- up pointer-jumping on the internal nodes of Raub 
and forward the values for both, the labels in Lr Aub \A and in Lr Aub \B. At every internal node 
u, during the 0(log t) rounds, we only keep track of the highest value regarding the two label sets. 
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Algorithm 6: Parallel _ Partition _ Relabeling 



Input: The tree Raub with root r and t leaves, parental pointers parent(i>), v € Raub 

and partition-labelings pa and ps for Ra and Rb- 
Output: Partition-labeling paub f° r Raob- 

foreach u G Raub parallel do 
if u S Ra then 

|^ a(u) := pa(u); /* a(-) and &(•) are initialized with */ 

if u G Rb then 

|_ b(u) := pb{u); 

for k = 1 to log i do 

foreach u € Raub with parent(n) / r parallel do 
a(parent(u)) := max{a(ti), a(parent(u))}; 
6(parent(u)) := max{6(u), 6(parent(u))}; 

parent(u) := parent(parent(n)); /* Pointer- Jumping */ 

PARALLEL_RADLX_SORT((a(?ii), b(ui)), . . . , (a(u t ), b(u t ))); /* In O(logt) time on 
0{t) processors */ 

Let ((^(lij-J, b{ui x )), . . . , (a(ui t ), b{ui t )) be the sorted sequence; 
for j = 1, . . . , t parallel do 



foreach j = 1, . . . ,t parallel do 

|_ PAuBiUy) := pAUB(Ui ieft(j) y, 



If u € Ra, then pa(u) is the highest label of the set Lr Aub \A and we correctly set a = pa(u)- 
Otherwise, if u £ Ra but there exist a child s of u in Raub with L/j Aus (s)|vl = Ln A (t), then 
Pa(£) is the highest label of the set Lr Aub \A and we set a = pA(t). If no such child exists, 
then Lf{ AuB (u)\A = an we keep the initial value a = 0. Similarly, the values for b are set by 
Algorithm [6] according to Lr Aub \B. 

After the relabeling process, we have Lr Aub (u) = Lr Aub (u)\A U Lr Aijb (u)\B and hence 
Lr Aub (p) = Lb; A{jb (q) if and only if the corresponding 2-tuples assigned to p and q are identical. 
Therefore, the labels assigned to the nodes after performing PARALLEL_RADIX_SORT (cf. 
|Ble90| ) on the 2-tuples form a valid partition labeling. PARALLEL_RADIX_SORT and the 
pointer-jumping are performed in O(logt) time on 0(t) processors and the initialization of 
a(u),b(u) is done in 0(1) parallel time. After the radix sort, we use bidirectional pointer- 
jumping to find for each node Uy the leftmost and the rightmost node in the block of n^-, 
consisting of all the nodes m k which have the same pair of labels as Ui- . Then we first assign 
new labels to the leftmost nodes of all blocks. This is done by performing a pointer-jumping 
on these nodes, using the pointers left(j) and right(j). Finally, in 0(1) time, we can also assign 
these labels to the remaining nodes, again using the pointers left(j). Therefore, Algorithm [6] 
runs in 0(logi) time on 0(t) processors. □ 

By Lemma [5] [6] and [7] we have that the overall complexity for the partition labeling problem 




foreach Ui- with left(j) = j parallel do 

I PAuBiuij) := k where m- is the &-th such node in the sorted order; 
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is O(logn) time on O(nlogn) processors. Next, we show how the problem of identifying good 
edge-pairs between phylogenies T\,Tz is reduced to the partition labeling problem between two 
rooted trees R, R' . 

Partition-Labeling and Good Edge-Pairs The reduction given by Hon et al. |HKL+04| 

starts by setting R = T\ and R' = T2. Then an arbitrary leaf with label a is fixed, and R and R' 
are rooted at the same internal node adjacent to the leaf with label a. Then each internal edge 
e = (u, v ) is replaced by a path u, s, v and a new leaf w with a unique label p(w), adjacent to s. 
This means, for newly added leaves wi,W2 corresponding to edges ei,e2, we have p{wi) = p{w2) 
if and only if wt(ei) = wt(e2). This completes the construction of R and R'. 

Lemma 8. The construction of R and R' takes O(logn) time on 0{n) processors. 

Proof. The rooting of R and R' at an arbitrary node takes O(logn) time on 0{n) processors 
using the Euler-Tour Technique and parallel prefix sum (cf. |TV84| ). In order to generate the 
labeled leaves that represent edge-weights, we temporarily assign for each edge e = (u, v) and 
newly added leaf w the edge-weight wt(e) to w. Then, we sort the sequence of leaf labels of the 
new leafs u>i,u>2, • • • ,w\e\ an d assign a unique label x £ S such that p{wi) = p(wj) if and only 
if wt(ej) = wt(ej). This can also be accomplished in O(logn) time on 0{n) processors. □ 

In |HKL + 04| the partition-labeling is used to identify bad edges in the trees. Here, we show 
how to use the labeling to compute pairs of good edges efficiently in parallel. 

Given the partition-labelings p and p' , we first generate the sorted sequences of labels 
p(vi), . . . ,p(ve) and p'(vi), . . . ,p'(vt). Then, for every position i of p such that Vi corresponds 
to an edge in the original tree T%, we activate one processor which performs in O(logn) time a 
binary search on the sequence p' in order to check if p{vi) occurs as a label p'(vj) in the other 
sequence. If Vj corresponds to an edge in the original tree T2, then these two edges form a good 
edge-pair. 

Altogether we have shown the following theorem. 

Theorem 3. The good edge-pairs between T\ and T% can be identified in O(logn) time on 
O(nlogn) processors. 



4 Summary 

We have designed a new efficient parallel approximation algorithm for the nearest-neighbor- 
interchange-distance (nni) of weighted phylogenies. Based on DasGupta's approximation algo- 
rithm |DHJ+ 00] our algorithm achieves an approximation ratio of O(logn) and also constructs 



an associated sequence of nni-operations. For the case that no good edge-pairs exist, our algo- 
rithm runs on a CRCW-PRAM with running time O(logn) and 0(n) processors. Furthermore, 
we show that the good edge-pairs between two weighted phylogenies can be identified in C(log n) 
time on O(nlogn) processors. 

The most challenging open problem is to settle the question if this problem is APX-hard. It 
would also be interesting to construct new algorithms with better approximation ratio for this 
problem. 
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